Purpose

This document is required to indicate where various requirements can be found within your Final Project Report Rmd. You must indicate line numbers as they appear in your final Rmd document accompanying each of the following required tasks. Points will be deducted if line numbers are missing or differ signficantly from the submitted Final Rmd document.

Final Project Requirements

Data Access

Description: (1) Analysis includes at least two different data sources. (2) Primary data source may NOT be loaded from an R package–though supporting data may. (3) Access to all data sources is contained within the analysis. (4) Imported data is inspected at beginning of analysis using one or more R functions: e.g., str, glimpse, head, tail, names, nrow, etc

  1. .Rmd Line numbers where at least two different data sources are imported:
rm(list = ls())
library(rvest)
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.3.2
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'stringr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr   1.1.2     ✔ readr   2.1.4
## ✔ forcats 1.0.0     ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4     ✔ tibble  3.2.1
## ✔ purrr   1.0.2     ✔ tidyr   1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()         masks stats::filter()
## ✖ readr::guess_encoding() masks rvest::guess_encoding()
## ✖ dplyr::lag()            masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(DataComputing)
## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2
#data("SanFranPublicBathrooms.csv")

sanfran=read.csv("SanFranPublicBathrooms.csv", header=TRUE)
head(sanfran)
library(remotes)
install_github("yonghah/esri2sf")
## Skipping install of 'esri2sf' from a github remote, the SHA1 (f47eda53) has not changed since last install.
##   Use `force = TRUE` to force installation
library("esri2sf")
url <- "https://services.arcgis.com/QPnoxtBFXm6yYtyc/ArcGIS/rest/services/SanFranciscoPoopMap_WFL1/FeatureServer/2"
poop <- esri2sf(url, objectIds = paste(collapse = ","))
## Layer Type: Feature Layer
## Geometry Type: esriGeometryPoint
## Service Coordinate Reference System: 3857
## Output Coordinate Reference System: 4326
  1. .Rmd Line numbers for inspecting data intake:
str(poop)
## Classes 'sf' and 'data.frame':   39949 obs. of  15 variables:
##  $ OBJECTID      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Date          : int  43466 43467 43599 43468 40177 40177 40177 43469 40177 43473 ...
##  $ RequestType   : chr  "Human or Animal Waste" "Human or Animal Waste" "Human or Animal Waste" "Human or Animal Waste" ...
##  $ Address       : chr  "2048 BRYANT ST, SAN FRANCISCO, CA, 94110" "18 ADAIR ST, SAN FRANCISCO, CA, 94103" "400 GOLDEN GATE AVE, SAN FRANCISCO, CA, 94102" "15 SHARON ST, SAN FRANCISCO, CA, 94114" ...
##  $ street        : chr  "BRYANT ST" "ADAIR ST" "GOLDEN GATE AVE" "SHARON ST" ...
##  $ supervisor    : int  9 9 6 8 3 3 3 9 6 3 ...
##  $ Neighborhood  : chr  "Mission" "Mission" "Civic Center" "Mission Dolores" ...
##  $ policedistrict: chr  "MISSION" "MISSION" "TENDERLOIN" "MISSION" ...
##  $ latitude      : num  37.8 37.8 37.8 37.8 37.8 ...
##  $ longitude     : num  -122 -122 -122 -122 -122 ...
##  $ source        : chr  "Mobile/Open311" "Mobile/Open311" "Mobile/Open311" "Mobile/Open311" ...
##  $ URL           : chr  "http://mobile311.sfgov.org/reports/10302217/photos" "http://mobile311.sfgov.org/reports/10304991/photos" "http://mobile311.sfgov.org/reports/10858769/photos" "http://mobile311.sfgov.org/reports/10311333/photos" ...
##  $ CurrentPD     : int  3 3 5 3 6 6 6 3 5 4 ...
##  $ Neighborhoods : int  53 53 21 37 106 106 106 53 32 50 ...
##  $ geoms         :sfc_POINT of length 39949; first list element:  'XY' Named num  -122.4 37.8
##   ..- attr(*, "names")= chr [1:2] "x" "y"
##  - attr(*, "sf_column")= chr "geoms"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "names")= chr [1:14] "OBJECTID" "Date" "RequestType" "Address" ...
head(sanfran)

Data Wrangling (5 out of 8 required)

Description: Students need not use every function and method introduced in STAT 184, but clear demonstration of proficiency should include proper use of 5 out of the following 8 topics from class: (+) various data verbs for general data wrangling like filter, mutate, summarise, arrange, group_by, etc. (+) joins for multiple data tables. (+) spread & gather to stack/unstack variables (+) regular expressions (+) reduction and/or transformation functions like mean, sum, max, min, n(), rank, pmin, etc. (+) user-defined functions (+) loops and control flow (+) machine learning

  1. .Rmd Line number(s) for general data wrangling:
library(sf)
## Warning: package 'sf' was built under R version 4.3.2
## Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
wrangled<-
  poop%>%
  mutate(URL=NULL,RequestType=NULL,source=NULL)%>%
  st_drop_geometry(df)

sanfranBR<-
  sanfran%>%
  mutate(uid=NULL, water_fountain=NULL, resource_type=NULL,bottle_filler=NULL,jug_filler=NULL, dog_fountain=NULL, notes=NULL, data_source=NULL,data_as_of=NULL,data_loaded_at=NULL)%>%
  st_drop_geometry(df)
  1. .Rmd Line number(s) for a join operation:
poopAtRestroom<-
  left_join(wrangled, sanfran, by=c('Neighborhood'='analysis_neighborhood'))
## Warning in left_join(wrangled, sanfran, by = c(Neighborhood = "analysis_neighborhood")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 1 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
head(poopAtRestroom)
  1. .Rmd Line number(s) for use of regular expressions: Note: part F has the regular expression
street_ending_finder=function(frame, street.col, street.ending){
  pattern <- paste(street.ending, "$", sep="")
  pattern
  Matches <- frame %>%
    filter(grepl(pattern = pattern, street.col, ignore.case = TRUE))
  return(Matches)
}

Matches <- street_ending_finder(poop, poop$Address, "ST")
Matches
  1. .Rmd Line number(s) for use of reduction and/or transformation functions:
x <- wrangled %>%
  select(Neighborhood) %>%
  group_by(Neighborhood) %>%
  summarise(poop_count=n())

y <- sanfran %>%
  select(analysis_neighborhood) %>%
  group_by(analysis_neighborhood) %>%
  summarise(restroom_count=n())

joined_table <- x %>% 
  full_join(y, join_by("Neighborhood" == "analysis_neighborhood"))

joined_table <- joined_table[!is.na(joined_table$Neighborhood),]
joined_table
  1. .Rmd Line number(s) for use of user-defined functions:
convert_num_to_date = function(date){
  hours_passed = (date %% 39630) * 2
  days = hours_passed %/% 24
  days
}

poop2 <- wrangled %>% mutate(dayReported=convert_num_to_date(wrangled$Date))
  1. .Rmd Line number(s) for a spread or gather operation (or equivalent):
poop2%>%
  group_by(dayReported,supervisor)%>%
  summarise(count=n())%>%
  pivot_wider(names_from=supervisor,values_from = count)
## `summarise()` has grouped output by 'dayReported'. You can override using the
## `.groups` argument.
  1. .Rmd Line number(s) for use of loops and/or control flow:
for (i in 1:length(joined_table$Neighborhood)){
  case <- joined_table[i,]
  for (j in 1:length(case)){
    variable <- joined_table[i,j]
    if (is.na(variable) ){
      joined_table[i,j] <- 0
    }
  }
}
  1. .Rmd Line number(s) for use of machine learning (not “wrangling” but scored here):

We want to explore if a restroom count in a neighborhood has an effect on poop count in a neighborhood. We will use linear regression to explore this.

The setup:

\(poop\_count_i \sim N(b_0 + b_1 \times restroom\_count_i, \sigma^2)\)

The following Hypothesis are used:

\[\begin{align*} H_0 &: b_1 = 0 \\ H_a &: b_1 \neq 0 \end{align*}\]

Approach taught in Stat 415, write a negative log likelihood function.

nll.regression = function(theta, poop_count, restroom_count){
  b0 = theta[1]
  b1 = theta[2]
  sigma_sq = theta[3]
  mean = b0 + b1 * restroom_count
  -sum(dnorm(poop_count, mean=mean, sd = sigma_sq, log = TRUE))
}

Here we use the optim function which will estimate the MLE for $ b_0, b_1$ and \(\sigma^2\)

out = optim(c(0, 0, 1), nll.regression, poop_count=joined_table$poop_count, restroom_count=joined_table$restroom_count, hessian=TRUE)


b1= out$par[2]
I = out$hessian

se = sqrt(diag(solve(I)))

Calculate the Wald-Test Statistic and check \(P(|Z^*| < 0.05)\)

Z = (b1 - 0)/se[2]

p_val = 2 * dnorm(-abs(Z))

p_val
## [1] 3.836861e-05

Since our p_value is less than 0.05, we can reject our null hypothesis and conclude that the number of restrooms in an area has an effect on the number of poops in that area

Data Visualization (3 of 4 required)

Description: Students need not use every function and method introduced in STAT 184, but clear demonstration of proficiency should include a range of useful of data visualizations that are (1) relevant to stated research question for the analysis, (2) include at least one effective display of many–at least 3–variables, and (3) include 3 of the following 5 visualization techniques learned in STAT 184: (+) use of multiple geoms such as points, density, lines, segments, boxplots, bar charts, histograms, etc (+) use of multiple aesthetics–not necessarily all in the same graph–such as color, size, shape, x/y position, facets, etc (+) layered graphics such as points and accompanying smoother, points and accompanying boxplots, overlaid density distributions, etc (+) decision tree and/or dendogram displaying machine learning model results

  1. .Rmd Line number(s) for use of mulitple different geoms:
    241,242,265,266

  2. .Rmd Line number(s) for use of multiple aesthetics:
    242,265,266

  3. .Rmd Line number(s) for use of layered graphics:
    241+242,265+266

  4. .Rmd Line number(s) for use of decision tree or dendogram results:

b0 = out$par[1]
b1 = out$par[2]

reg_func = function(b0, b1, x){
  b0 + b1 * x
}

ggplot(joined_table) + geom_point(aes(x = restroom_count, y = poop_count)) + 
  geom_line(aes(x = restroom_count, y = reg_func(b0, b1, restroom_count), color = "red"))

wrangledA <- data.frame(Neighborhood = sample(c("A", "B", "C", "D"), 100, replace = TRUE))

sanfranA <- data.frame(analysis_neighborhood = sample(c("A", "B", "C", "D"), 120, replace = TRUE))

x <- wrangled %>%
  select(Neighborhood) %>%
  group_by(Neighborhood) %>%
  summarise(poop_count = n())

y <- sanfranA %>%
  select(analysis_neighborhood) %>%
  group_by(analysis_neighborhood) %>%
  summarise(restroom_count = n())

joined_table <- x %>% full_join(y, by = c("Neighborhood" = "analysis_neighborhood"))


ggplot(joined_table, aes(x = Neighborhood)) +
  geom_bar(aes(y = poop_count, fill = "Poop Count"), stat = "identity", position = "dodge") +
  geom_bar(aes(y = restroom_count, fill = "Restroom Count"), stat = "identity", position = "dodge") +
  labs(title = "Poop and Restroom Counts in San Francisco Neighborhoods",
       x = "Neighborhood",
       y = "Count",
       fill = "Legend") +
  scale_fill_manual(values = c("Poop Count" = "#FF5733", "Restroom Count" = "#33B5E5")) +
  theme_minimal() +
  theme(legend.position = "top", legend.title = element_blank()) +
  scale_y_continuous(expand = expansion(mult = c(0.1, 0.2)))
## Warning: Removed 4 rows containing missing values (`geom_bar()`).
## Warning: Removed 117 rows containing missing values (`geom_bar()`).

#devtools::install_github("rstudio/leaflet")

library(leaflet)
## Warning: package 'leaflet' was built under R version 4.3.2
Poop_Map <- 
   leaflet() %>%
  addTiles() %>%
   addMarkers(clusterOptions = markerClusterOptions(), data=poop) %>%
   addCircleMarkers(radius=2, color="red", data=sanfran) %>%
   setView(-122.44, 37.76849, zoom=12)%>% 
   addLegend(position = "bottomleft",
                         colors = c("red", "orange"),
                         labels = c("Public_Restrooms", "Poop Clusters are numbered"),
                         title = "Marker Categories")
## Assuming "longitude" and "latitude" are longitude and latitude, respectively
Poop_Map

Other requirements (Nothing for you to report in this Guidance Document)

  1. All data visualizations must be relevant to the stated research question, and the report must include at least one effective display of many–at least 3–variables

  2. Code quality: Code formatting is consistent with Style Guide Appendix of DataComputing eBook. Specifically, all code chunks demonstrate proficiency with (1) meaningful object names (2) proper use of white space especially with respect to infix operators, chain operators, commas, brackets/parens, etc (3) use of <- assignment operator throughout (4) use of meaningful comments.

  3. Narrative quality: The narrative text (1) clearly states one research question that motivates the overall analysis, (2) explains reasoning for each significant step in the analysis and it’s relationship to the research question, (3) explains significant findings and conclusions as they relate to the research question, and (4) is completely free of errors in spelling and grammar

  4. Overall Quality: Submitted project shows significant effort to produce a high-quality and thoughtful analysis that showcases STAT 184 skills. (2) The project must be self-contained, such that the analysis can be entirely rerun without errors. (3) Analysis is coherent, well-organized, and free of extraneous content such as data dumps, unrelated graphs, and other content that is not overtly connected to the research question.

  5. EXTRA CREDIT (1) Project is submitted as a self-contained GitHub Repo (2) project submission is a functioning github.io webpage generated for the project Repo. Note: a link to the GitHub Repo itself will be awarded partial credit, but does not itself qualify as a “webpage” of the analysis.